Visualizing statistics associated with taxa can be difficult due to the hierarchical nature of the information. Traditional graph types used to visualize the relationship between categories (e.g. taxa) and quantities (e.g. abundance) such as bar charts, pie charts, and box plots, are fundamentally two-dimensional. This means the it is usually necessary to only view a ‘slice’ of the data, such as the abundance of taxa of a particular rank, rather than all ranks. To demonstrate this idea, lets display the same data using various graphing techniques and compare their effectiveness. For this example we will be using a sample of 500 sequences from the UNITE fungal ITS database. First, we will use a bar chart to display abundance of taxa at each rank:
For ranks with relatively few taxa, this is a satisfactory graphing technique, but it is ineffective once there is more than 20 or so taxa. It also is difficult to discern how sub-taxa are distributed within each taxon. For example, if we only looked at the phylum or class level, we can easily see that Basidiomycota/Agaricomyctes are the most plentiful, but don’t know if that is due to a single highly abundant species, or many moderately abundant species. These details are typically important for the interpretation of results.
Pie charts are also commonly used for this purpose, but they are just a less effective version of a bar chart.
Metacoder approaches this problem by using size and color to represent numeric data distributed along a phylogenetic tree:
In this example, both size and color are being used to represent the abundance of taxa. Using this method, it is clear how subtaxa are distributed within their supertaxa and what taxa are unusual (note the unidentified agaricales).
Although there are many options that can be used to make highly customized graphs, heat_tree only needs one argument to function: an object of type taxmap. We can see the default appearance of the data used in the introduction using the code below:
library(metacoder)
unite_ex_data_3
## `taxmap` object with data for 703 taxa and 500 observations:
##
## ----------------------------------------- taxa -----------------------------------------
## 1, 2, 3, 4, 5, 6, 7, 8, 9 ... 694, 695, 696, 697, 698, 699, 700, 701, 702, 703
##
## -------------------------------------- taxon_data --------------------------------------
## # A tibble: 703 × 4
## taxon_ids supertaxon_ids unite_rank name
## <chr> <chr> <chr> <chr>
## 1 1 <NA> k Fungi
## 2 2 1 p Ascomycota
## 3 3 1 p Basidiomycota
## 4 4 1 p Chytridiomycota
## 5 5 1 p Glomeromycota
## 6 6 1 p unidentified
## 7 7 1 p Zygomycota
## # ... with 696 more rows
##
## --------------------------------------- obs_data ---------------------------------------
## # A tibble: 500 × 5
## obs_taxon_ids seq_name seq_id other_id
## <chr> <chr> <chr> <chr>
## 1 183 Lachnum_sp JQ347180 SH189775.06FU
## 2 175 Lachnellula_calyciformis U59145 SH189776.06FU
## 3 183 Lachnum_sp AM084756 SH189777.06FU
## 4 183 Lachnum_sp FM172814 SH189778.06FU
## 5 183 Lachnum_sp FN539058 SH189779.06FU
## 6 181 Lachnum_pulverulentum AB481260 SH189780.06FU
## 7 183 Lachnum_sp HQ211694 SH189781.06FU
## # ... with 493 more rows, and 1 more variables: sequence <chr>
##
## ------------------------------------- taxon_funcs -------------------------------------
## n_obs, n_obs_1, n_supertaxa, n_subtaxa, n_subtaxa_1, hierarchies
heat_tree(unite_ex_data_3)
Each node (i.e. circle) in the graph represents a taxon and each line represents its membership in a lower taxon.
The size of nodes and lines can be scaled to a number associated with each taxon using the node_size and edge_size parameters. Below, the number of sequences for each taxon is used to determine node size.
heat_tree(unite_ex_data_3,
node_size = n_obs)
Note that it was not necessary to specify the absolute node size; the range of absolute node sizes is optimized for each graph so as to minimize overlap of nodes and maximize the ranges of sizes. The argument overlap_avoidance is used to determine how much overlaps are avoided. Higher values mean more importance is given to avoiding overlapping nodes than to maximizing the ranges of sizes. A high overlap_avoidance makes the connections between taxa more clear, but diminishes the visual effect of node size. Too low of an overlap_avoidance can make the graph hard to read.
heat_tree(unite_ex_data_3,
node_size = n_obs,
overlap_avoidance = 10)
heat_tree(unite_ex_data_3,
node_size = n_obs,
overlap_avoidance = 0.1)
The node_color argument works in a similar way to node_size. Numeric values are translated to a range of colors. Below the abundance of samples for each taxon is used to determine color instead of size. The range of color used can be set using the node_color_range argument. This argument take a list of colors in the form of names, hex color codes, or integers.
heat_tree(unite_ex_data_3,
node_size = n_obs,
node_color = n_obs)
heat_tree(unite_ex_data_3,
node_size = n_obs,
node_color = n_obs,
node_color_range = c("#FFFFFF", "darkorange3", "#4e567d", "gold"))
Like node_size, the color of lines can be set independently of nodes, although the default behavior is for the lines to have the same color as the nodes. To only color nodes, you can set the lines to be a constant color or vise-versa.
heat_tree(unite_ex_data_3,
node_size = n_obs,
node_color = n_obs,
edge_color = "grey")
heat_tree(unite_ex_data_3,
node_size = n_obs,
node_color = "grey",
edge_color = n_obs)
You can also set the color palette used for the lines in the same way as you set it for the node using the argument edge_color_range.
Labels can be added to nodes using the node_label option:
heat_tree(unite_ex_data_3,
node_size = n_obs,
node_color = n_obs,
node_label = name)
Label sizes are proportional to node size by default. By default, only a maximum number of labels are printed to avoid excessive crowding. The maximum number of labels that will be printed is controlled by the node_label_max option:
heat_tree(unite_ex_data_3,
node_size = n_obs,
node_color = n_obs,
node_label = name,
node_label_max = 5)
heat_tree(unite_ex_data_3,
node_size = n_obs,
node_color = n_obs,
node_label = name,
node_label_max = 200)
Note that the labels are a special kind that scales with the size of the graph. This means that the text size will always be proportional to the graph size regardless of ow big the graph is rendered; however, these special labels take more time to render, so causing too many to be printed drastically slow the rendering of the graph.
Lines can be labeled as well using the edge_label option, which works similarly to the node_label option:
heat_tree(unite_ex_data_3,
node_size = n_obs,
node_color = n_obs,
edge_label = name)
The default background color is transparent in order to make formatting posters and slideshows as flexible as possible. Other background colors can be specified using the background_color option:
heat_tree(unite_ex_data_3,
node_size = n_obs,
node_color = n_obs,
background_color = "grey")
Plots can be saved using ggsave from the ggplot2 package or using the output_file option:
my_plot <- heat_tree(unite_ex_data_3,
node_size = n_obs,
node_color = n_obs)
ggplot2::ggsave("path/to/my/output.png", my_plot, bg = "transparent")
heat_tree(unite_ex_data_3,
node_size = n_obs,
node_color = n_obs,
output_file = "path/to/my/output.png")
Sometimes a taxonomy has multiple roots. This occurs when there is not a common taxon all observations are assigned to, like “Eukaryota”, if all your observations are associated with eukayotes. metacoder plots taxonomies with multiple roots as multiple trees:
heat_tree(contaminants,
node_size = n_obs,
node_color = n_obs,
node_label = name,
tree_label = name,
layout = "fruchterman-reingold")
To see the long list of available plotting options, type ?heat_tree.
Comments